{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Copyright 2019 Google LLC"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "# you may not use this file except in compliance with the License.\n",
    "# You may obtain a copy of the License at\n",
    "#\n",
    "# https://www.apache.org/licenses/LICENSE-2.0\n",
    "#\n",
    "# Unless required by applicable law or agreed to in writing, software\n",
    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "# See the License for the specific language governing permissions and\n",
    "# limitations under the License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Online Prediction with scikit-learn on AI Platform\n",
    "This notebook uses the [Census Income Data Set](https://archive.ics.uci.edu/ml/datasets/Census+Income) to create a simple model, train the model, upload the model to Ai Platform, and lastly use the model to make predictions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# How to bring your model to AI Platform\n",
    "Getting your model ready for predictions can be done in 5 steps:\n",
    "1. Save your model to a file\n",
    "1. Upload the saved model to [Google Cloud Storage](https://cloud.google.com/storage)\n",
    "1. Create a model resource on AI Platform\n",
    "1. Create a model version (linking your scikit-learn model)\n",
    "1. Make an online prediction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Prerequisites\n",
    "Before you jump in, let’s cover some of the different tools you’ll be using to get online prediction up and running on AI Platform. \n",
    "\n",
    "[Google Cloud Platform](https://cloud.google.com/) lets you build and host applications and websites, store data, and analyze data on Google's scalable infrastructure.\n",
    "\n",
    "[AI Platform](https://cloud.google.com/ml-engine/) is a managed service that enables you to easily build machine learning models that work on any type of data, of any size.\n",
    "\n",
    "[Google Cloud Storage](https://cloud.google.com/storage/) (GCS) is a unified object storage for developers and enterprises, from live data serving to data analytics/ML to data archiving.\n",
    "\n",
    "[Cloud SDK](https://cloud.google.com/sdk/) is a command line tool which allows you to interact with Google Cloud products. In order to run this notebook, make sure that Cloud SDK is [installed](https://cloud.google.com/sdk/downloads) in the same environment as your Jupyter kernel.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Part 0: Setup\n",
    "* [Create a project on GCP](https://cloud.google.com/resource-manager/docs/creating-managing-projects)\n",
    "* [Create a Google Cloud Storage Bucket](https://cloud.google.com/storage/docs/quickstart-console)\n",
    "* [Enable AI Platform Training and Prediction and Compute Engine APIs](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,compute_component&_ga=2.217405014.1312742076.1516128282-1417583630.1516128282)\n",
    "* [Install Cloud SDK](https://cloud.google.com/sdk/downloads)\n",
    "* [Install scikit-learn](http://scikit-learn.org/stable/install.html)\n",
    "* [Install NumPy](https://docs.scipy.org/doc/numpy/user/install.html)\n",
    "* [Install pandas](https://pandas.pydata.org/pandas-docs/stable/install.html)\n",
    "* [Install Google API Python Client](https://github.com/google/google-api-python-client)\n",
    "\n",
    "These variables will be needed for the following steps.\n",
    "\n",
    "** Replace: **\n",
    "* `PROJECT_ID <YOUR_PROJECT_ID>` - with your project's id. Use the PROJECT_ID that matches your Google Cloud Platform project.\n",
    "* `BUCKET_NAME <YOUR_BUCKET_NAME>` - with the bucket id you created above.\n",
    "* `MODEL_NAME <YOUR_MODEL_NAME>` - with your model name, such as '`census`'\n",
    "* `VERSION <YOUR_VERSION>` - with your version name, such as '`v1`'\n",
    "* `REGION <REGION>` - [select a region](https://cloud.google.com/ml-engine/docs/tensorflow/regions#available_regions) or use the default '`us-central1`'. The region is where the model will be deployed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "env: PROJECT_ID=PROJECT_ID\n",
      "env: BUCKET_NAME=BUCKET_NAME\n",
      "env: MODEL_NAME=census\n",
      "env: VERSION_NAME=v1\n",
      "env: REGION=us-central1\n"
     ]
    }
   ],
   "source": [
    "%env PROJECT_ID PROJECT_ID\n",
    "%env BUCKET_NAME BUCKET_NAME\n",
    "%env MODEL_NAME census\n",
    "%env VERSION_NAME v1\n",
    "%env REGION us-central1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Download the data\n",
    "The [Census Income Data Set](https://archive.ics.uci.edu/ml/datasets/Census+Income) that this sample\n",
    "uses for training is hosted by the [UC Irvine Machine Learning\n",
    "Repository](https://archive.ics.uci.edu/ml/datasets/).\n",
    "\n",
    " * Training file is `adult.data`\n",
    " * Evaluation file is `adult.test`\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "### Disclaimer\n",
    "This dataset is provided by a third party. Google provides no representation,\n",
    "warranty, or other guarantees about the validity or any other aspects of this dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "mkdir: census_data: File exists\n",
      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
      "                                 Dload  Upload   Total   Spent    Left  Speed\n",
      "100 3881k  100 3881k    0     0  7804k      0 --:--:-- --:--:-- --:--:-- 7824k\n",
      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
      "                                 Dload  Upload   Total   Spent    Left  Speed\n",
      "100 1956k  100 1956k    0     0  6895k      0 --:--:-- --:--:-- --:--:-- 6912k\n"
     ]
    }
   ],
   "source": [
    "# Create a directory to hold the data\n",
    "! mkdir census_data\n",
    "\n",
    "# Download the data\n",
    "! curl https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data --output census_data/adult.data\n",
    "! curl https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test --output census_data/adult.test"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Part 1: Train/Save the model \n",
    "First, the data is loaded into a pandas DataFrame that can be used by scikit-learn. Then a simple model is created and fit against the training data. Lastly, sklearn's built in version of joblib is used to save the model to a file that can be uploaded to AI Platform."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model trained and saved\n"
     ]
    }
   ],
   "source": [
    "import googleapiclient.discovery\n",
    "import json\n",
    "import numpy as np\n",
    "import os\n",
    "import pandas as pd\n",
    "import pickle\n",
    "\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.externals import joblib\n",
    "from sklearn.feature_selection import SelectKBest\n",
    "from sklearn.pipeline import FeatureUnion\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.preprocessing import LabelBinarizer\n",
    "\n",
    "# Define the format of your input data including unused columns (These are the columns from the census data files)\n",
    "COLUMNS = (\n",
    "    'age',\n",
    "    'workclass',\n",
    "    'fnlwgt',\n",
    "    'education',\n",
    "    'education-num',\n",
    "    'marital-status',\n",
    "    'occupation',\n",
    "    'relationship',\n",
    "    'race',\n",
    "    'sex',\n",
    "    'capital-gain',\n",
    "    'capital-loss',\n",
    "    'hours-per-week',\n",
    "    'native-country',\n",
    "    'income-level'\n",
    ")\n",
    "\n",
    "# Categorical columns are columns that need to be turned into a numerical value to be used by scikit-learn\n",
    "CATEGORICAL_COLUMNS = (\n",
    "    'workclass',\n",
    "    'education',\n",
    "    'marital-status',\n",
    "    'occupation',\n",
    "    'relationship',\n",
    "    'race',\n",
    "    'sex',\n",
    "    'native-country'\n",
    ")\n",
    "\n",
    "\n",
    "# Load the training census dataset\n",
    "with open('./census_data/adult.data', 'r') as train_data:\n",
    "    raw_training_data = pd.read_csv(train_data, header=None, names=COLUMNS)\n",
    "\n",
    "# Remove the column we are trying to predict ('income-level') from our features list\n",
    "# Convert the Dataframe to a lists of lists\n",
    "train_features = raw_training_data.drop('income-level', axis=1).as_matrix().tolist()\n",
    "# Create our training labels list, convert the Dataframe to a lists of lists\n",
    "train_labels = (raw_training_data['income-level'] == ' >50K').as_matrix().tolist()\n",
    "\n",
    "\n",
    "# Load the test census dataset\n",
    "with open('./census_data/adult.test', 'r') as test_data:\n",
    "    raw_testing_data = pd.read_csv(test_data, names=COLUMNS, skiprows=1)\n",
    "# Remove the column we are trying to predict ('income-level') from our features list\n",
    "# Convert the Dataframe to a lists of lists\n",
    "test_features = raw_testing_data.drop('income-level', axis=1).values.tolist()\n",
    "# Create our training labels list, convert the Dataframe to a lists of lists\n",
    "test_labels = (raw_testing_data['income-level'] == ' >50K.').values.tolist()\n",
    "\n",
    "\n",
    "# Since the census data set has categorical features, we need to convert\n",
    "# them to numerical values. We'll use a list of pipelines to convert each\n",
    "# categorical column and then use FeatureUnion to combine them before calling\n",
    "# the RandomForestClassifier.\n",
    "categorical_pipelines = []\n",
    "\n",
    "# Each categorical column needs to be extracted individually and converted to a numerical value.\n",
    "# To do this, each categorical column will use a pipeline that extracts one feature column via\n",
    "# SelectKBest(k=1) and a LabelBinarizer() to convert the categorical value to a numerical one.\n",
    "# A scores array (created below) will select and extract the feature column. The scores array is\n",
    "# created by iterating over the COLUMNS and checking if it is a CATEGORICAL_COLUMN.\n",
    "for i, col in enumerate(COLUMNS[:-1]):\n",
    "    if col in CATEGORICAL_COLUMNS:\n",
    "        # Create a scores array to get the individual categorical column.\n",
    "        # Example:\n",
    "        #  data = [39, 'State-gov', 77516, 'Bachelors', 13, 'Never-married', 'Adm-clerical', \n",
    "        #         'Not-in-family', 'White', 'Male', 2174, 0, 40, 'United-States']\n",
    "        #  scores = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\n",
    "        #\n",
    "        # Returns: [['State-gov']]\n",
    "        # Build the scores array.\n",
    "        scores = [0] * len(COLUMNS[:-1]) \n",
    "        # This column is the categorical column we want to extract.\n",
    "        scores[i] = 1\n",
    "        skb = SelectKBest(k=1)\n",
    "        skb.scores_ = scores\n",
    "        # Convert the categorical column to a numerical value\n",
    "        lbn = LabelBinarizer()\n",
    "        r = skb.transform(train_features)\n",
    "        lbn.fit(r)\n",
    "        # Create the pipeline to extract the categorical feature\n",
    "        categorical_pipelines.append(\n",
    "            ('categorical-{}'.format(i), Pipeline([\n",
    "                ('SKB-{}'.format(i), skb),\n",
    "                ('LBN-{}'.format(i), lbn)])))\n",
    "\n",
    "# Create pipeline to extract the numerical features\n",
    "skb = SelectKBest(k=6)\n",
    "# From COLUMNS use the features that are numerical\n",
    "skb.scores_ = [1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0]\n",
    "categorical_pipelines.append(('numerical', skb))\n",
    "\n",
    "# Combine all the features using FeatureUnion\n",
    "preprocess = FeatureUnion(categorical_pipelines)\n",
    "\n",
    "# Create the classifier\n",
    "classifier = RandomForestClassifier()\n",
    "\n",
    "# Transform the features and fit them to the classifier\n",
    "classifier.fit(preprocess.transform(train_features), train_labels)\n",
    "\n",
    "# Create the overall model as a single pipeline\n",
    "pipeline = Pipeline([\n",
    "    ('union', preprocess),\n",
    "    ('classifier', classifier)\n",
    "])\n",
    "\n",
    "# Export the model to a file\n",
    "joblib.dump(pipeline, 'model.joblib')\n",
    "\n",
    "print('Model trained and saved')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Part 2: Upload the model\n",
    "Next, you'll need to upload the model to your project's storage bucket in GCS. To use your model with AI Platform, it needs to be uploaded to Google Cloud Storage (GCS). This step takes your local ‘model.joblib’ file and uploads it GCS via the Cloud SDK using gsutil.\n",
    "\n",
    "Before continuing, make sure you're [properly authenticated](https://cloud.google.com/sdk/gcloud/reference/auth/) and have [access to the bucket](https://cloud.google.com/storage/docs/access-control/). This next command sets your project to the one specified above.\n",
    "\n",
    "Note: If you get an error below, make sure the Cloud SDK is installed in the kernel's environment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Updated property [core/project].\r\n"
     ]
    }
   ],
   "source": [
    "! gcloud config set project $PROJECT_ID"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Note: The exact file name of of the exported model you upload to GCS is important! Your model must be named  “model.joblib”, “model.pkl”, or “model.bst” with respect to the library you used to export it. This restriction ensures that the model will be safely reconstructed later by using the same technique for import as was used during export."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Copying file://./model.joblib [Content-Type=application/octet-stream]...\n",
      "| [1 files][  7.8 MiB/  7.8 MiB]                                                \n",
      "Operation completed over 1 objects/7.8 MiB.                                      \n"
     ]
    }
   ],
   "source": [
    "! gsutil cp ./model.joblib gs://$BUCKET_NAME/model.joblib"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Part 3: Create a model resource\n",
    "AI Platform organizes your trained models using model and version resources. An AI Platform model is a container for the versions of your machine learning model. For more information on model resources and model versions look [here](https://cloud.google.com/ml-engine/docs/deploying-models#creating_a_model_version). \n",
    "\n",
    "At this step, you create a container that you can use to hold several different versions of your actual model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Created AI Platform model [projects/PROJECT_ID/models/census].\r\n"
     ]
    }
   ],
   "source": [
    "! gcloud ml-engine models create $MODEL_NAME --regions $REGION"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Part 4: Create a model version\n",
    "\n",
    "Now it’s time to get your model online and ready for predictions. The model version requires a few components as specified [here](https://cloud.google.com/ml-engine/reference/rest/v1/projects.models.versions#Version).\n",
    "\n",
    "* __name__ - The name specified for the version when it was created. This will be the `VERSION_NAME` variable you declared at the beginning.\n",
    "* __model__ - The name of the model container we created in Part 3. This is the `MODEL_NAME` variable you declared at the beginning.\n",
    "* __deployment Uri__ - The Google Cloud Storage location of the trained model used to create the version. This is the bucket that you uploaded the model to with your `BUCKET_NAME`\n",
    "* __runtime version__ - [Select Google Cloud ML runtime version](https://cloud.google.com/ml-engine/docs/tensorflow/runtime-version-list) to use for this deployment. This is set to 1.4\n",
    "* __framework__ - The framework specifies if you are using: `TENSORFLOW`, `SCIKIT_LEARN`, `XGBOOST`. This is set to `SCIKIT_LEARN`\n",
    "* __pythonVersion__ - This specifies whether you’re using Python 2.7 or Python 3.5. The default value is set to `“2.7”`, if you are using Python 3.5, set the value to `“3.5”`\n",
    "\n",
    "Note: If you require a feature of scikit-learn that isn’t available in the publicly released version yet, you can specify “runtimeVersion”: “HEAD” instead, and that would get the latest version of scikit-learn available from the github repo. Otherwise the following versions will be used:\n",
    "* scikit-learn: 0.19.0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "First, we need to create a YAML file to configure our model version.\n",
    "\n",
    "__REPLACE:__ `PREVIOUSLY_SPECIFIED_BUCKET_NAME` with your `BUCKET_NAME`\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Writing ./config.yaml\n"
     ]
    }
   ],
   "source": [
    "%%writefile ./config.yaml\n",
    "deploymentUri: \"gs://BUCKET_NAME/\"\n",
    "runtimeVersion: '1.4'\n",
    "framework: \"SCIKIT_LEARN\"\n",
    "pythonVersion: \"3.5\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Use the created YAML file to create a model version.\n",
    "\n",
    "Note: It can take several minutes for you model to be available.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Creating version (this might take a few minutes)......done.                    \n"
     ]
    }
   ],
   "source": [
    "! gcloud ml-engine versions create $VERSION_NAME \\\n",
    "    --model $MODEL_NAME \\\n",
    "    --config config.yaml"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Part 5: Make an online prediction\n",
    "It’s time to make an online prediction with your newly deployed model. Before you begin, you'll need to take some of the test data and prepare it, so that the test data can be used by the deployed model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Show a person that makes <=50K:\n",
      "\tFeatures: [25, ' Private', 226802, ' 11th', 7, ' Never-married', ' Machine-op-inspct', ' Own-child', ' Black', ' Male', 0, 0, 40, ' United-States'] --> Label: False\n",
      "\n",
      "Show a person that makes >50K:\n",
      "\tFeatures: [44, ' Private', 160323, ' Some-college', 10, ' Married-civ-spouse', ' Machine-op-inspct', ' Husband', ' Black', ' Male', 7688, 0, 40, ' United-States'] --> Label: True\n"
     ]
    }
   ],
   "source": [
    "# Get one person that makes <=50K and one that makes >50K to test our model.\n",
    "print('Show a person that makes <=50K:')\n",
    "print('\\tFeatures: {0} --> Label: {1}\\n'.format(test_features[0], test_labels[0]))\n",
    "\n",
    "with open('less_than_50K.json', 'w') as outfile:\n",
    "  json.dump(test_features[0], outfile)\n",
    "\n",
    "  \n",
    "print('Show a person that makes >50K:')\n",
    "print('\\tFeatures: {0} --> Label: {1}'.format(test_features[3], test_labels[3]))\n",
    "\n",
    "with open('more_than_50K.json', 'w') as outfile:\n",
    "  json.dump(test_features[3], outfile)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Use gcloud to make online predictions\n",
    "Use the two people (as seen in the table) gathered in the previous step for the gcloud predictions.\n",
    "\n",
    "| **Person** | age | workclass | fnlwgt | education | education-num | marital-status | occupation |\n",
    "|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:\n",
    "| **1** | 25| Private | 226802 | 11th | 7 | Never-married | Machine-op-inspect |\n",
    "| **2** | 44| Private | 160323 | Some-college | 10 | Married-civ-spouse | Machine-op-inspct |\n",
    "\n",
    "| **Person** | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country || (Label) income-level|\n",
    "|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:||:-:\n",
    "| **1** | Own-child | Black | Male | 0 | 0 | 40 | United-States || False (<=50K) |\n",
    "| **2** | Huasband | Black | Male | 7688 | 0 | 40 | United-States || True (>50K) |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Test the model with an online prediction using the data of a person who makes <=50K.\n",
    "\n",
    "Note: If you see an error, the model from Part 4 may not be created yet as it takes several minutes for a new model version to be created."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[False]\r\n"
     ]
    }
   ],
   "source": [
    "! gcloud ml-engine predict --model $MODEL_NAME --version $VERSION_NAME --json-instances less_than_50K.json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Test the model with an online prediction using the data of a person who makes >50K."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[True]\r\n"
     ]
    }
   ],
   "source": [
    "! gcloud ml-engine predict --model $MODEL_NAME --version $VERSION_NAME --json-instances more_than_50K.json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Use Python to make online predictions\n",
    "Test the model with the entire test set and print out some of the results.\n",
    "\n",
    "Note: If running notebook server on Compute Engine, make sure to [\"allow full access to all Cloud APIs\".](https://cloud.google.com/compute/docs/access/create-enable-service-accounts-for-instances#changeserviceaccountandscopes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Prediction: False\tLabel: False\n",
      "Prediction: False\tLabel: False\n",
      "Prediction: False\tLabel: True\n",
      "Prediction: True\tLabel: True\n",
      "Prediction: False\tLabel: False\n",
      "Prediction: False\tLabel: False\n",
      "Prediction: False\tLabel: False\n",
      "Prediction: False\tLabel: True\n",
      "Prediction: False\tLabel: False\n",
      "Prediction: False\tLabel: False\n"
     ]
    }
   ],
   "source": [
    "import googleapiclient.discovery\n",
    "import os\n",
    "import pandas as pd\n",
    "\n",
    "PROJECT_ID = os.environ['PROJECT_ID']\n",
    "VERSION_NAME = os.environ['VERSION_NAME']\n",
    "MODEL_NAME = os.environ['MODEL_NAME']\n",
    "\n",
    "service = googleapiclient.discovery.build('ml', 'v1')\n",
    "name = 'projects/{}/models/{}'.format(PROJECT_ID, MODEL_NAME)\n",
    "name += '/versions/{}'.format(VERSION_NAME)\n",
    "\n",
    "# Due to the size of the data, it needs to be split in 2\n",
    "first_half = test_features[:int(len(test_features)/2)]\n",
    "second_half = test_features[int(len(test_features)/2):]\n",
    "\n",
    "complete_results = []\n",
    "for data in [first_half, second_half]:\n",
    "    responses = service.projects().predict(\n",
    "        name=name,\n",
    "        body={'instances': data}\n",
    "    ).execute()\n",
    "\n",
    "    if 'error' in responses:\n",
    "        print(response['error'])\n",
    "    else:\n",
    "        complete_results.extend(responses['predictions'])\n",
    "        \n",
    "# Print the first 10 responses\n",
    "for i, response in enumerate(complete_results[:10]):\n",
    "    print('Prediction: {}\\tLabel: {}'.format(response, test_labels[i]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# [Optional] Part 6: Verify Results\n",
    "Use a confusion matrix to create a visualization of the online predicted results from AI Platform."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>online</th>\n",
       "      <th>False</th>\n",
       "      <th>True</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>actual</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>False</th>\n",
       "      <td>11606</td>\n",
       "      <td>829</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>1644</td>\n",
       "      <td>2202</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "online  False  True \n",
       "actual              \n",
       "False   11606    829\n",
       "True     1644   2202"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "actual = pd.Series(test_labels, name='actual')\n",
    "online = pd.Series(complete_results, name='online')\n",
    "\n",
    "pd.crosstab(actual,online)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Use a confusion matrix create a visualization of the predicted results from the local model. These results should be identical to the results above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>local</th>\n",
       "      <th>False</th>\n",
       "      <th>True</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>actual</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>False</th>\n",
       "      <td>11606</td>\n",
       "      <td>829</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>1644</td>\n",
       "      <td>2202</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "local   False  True \n",
       "actual              \n",
       "False   11606    829\n",
       "True     1644   2202"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "local_results = pipeline.predict(test_features)\n",
    "local = pd.Series(local_results, name='local')\n",
    "\n",
    "pd.crosstab(actual,local)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Directly compare the two results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "identical: 16281, different: 0\n"
     ]
    }
   ],
   "source": [
    "identical = 0\n",
    "different = 0\n",
    "\n",
    "for i in range(len(complete_results)):\n",
    "    if complete_results[i] == local_results[i]:\n",
    "        identical += 1\n",
    "    else:\n",
    "        different += 1\n",
    "        \n",
    "print('identical: {}, different: {}'.format(identical,different))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "If all results are identical, it means you've successfully uploaded your local model to AI Platform and performed online predictions correctly."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
